Method and apparatus for mRNA assembly

ABSTRACT

A method of comparing nucleic acid sequences being ESTs included in a first database of sequences and nucleic acid sequences included in a second database of sequences to form groups of sequences from the two databases that all relate to the same gene. For each one or more n-groups of sequences of one of the two databases, associating therewith lists of nucleic acid sequences, each from one of said two databases, each sequence on the list containing the n-groups, and matching sequences on the lists to generate said group.

FIELD OF THE INVENTION

The present invention relates to automatic assembly of mRNA sequencesfrom databases containing large numbers of partial cDNA sequences.

BACKGROUND OF THE INVENTION

In human cells, genetic material is stored as DNA in a nucleus of thecell. When a certain protein is needed by the cell, a portion of the DNAis transcribed as mRNA, which is transported the cytoplasm of the cell.In the cytoplasm, ribosomes create proteins, using the mRNA as atemplate. Generally, the mRNA comprises a long sequence of bases, eachtriplet (codon) of which encodes a specific amino acid. Thus, a sequenceof triplets encodes a sequence of amino acids, which form a protein.

Cell function can, theoretically, be analyzed by determining the type ofand ratio between the proteins in the cell. However, proteins are verydelicate materials, which are difficult to analyze. mRNA, which controlsthe creation of the proteins, is easier to separate and analyze.Although several different mRNA sequences may encode similar actingproteins, each mRNA sequence encodes only a single protein. In addition,there is usually a good correlation between the relative amount ofdifferent types of mRNA and the relative amounts of protein. It is thuspossible to analyze cell function by analyzing the mRNA in a cell.

It should be noted that mRNA contains two types of information which arenot evident from DNA. First, the relative concentration of the mRNAindicates the abundance of a particular protein. Second, in the processof transcribing DNA, changes, especially deletions, are made to thenucleotide sequence.

Differential analysis is used to generate standardized databases ofhuman cellular activity by determining differences between geneexpression in sick cells and healthy cells and between cells fromdifferent tissues. The result of a differential analysis between twocells is the difference in the type and expression level of mRNAsequences. In some cells, for example cancer cells, there is a higherconcentration of certain proteins than in healthy cells of the sametissue. Determining these differences can help researchers determine howa cancer cell functions differently from healthy cells. Analysis of mRNAis currently being used to generate drug leads. For example, byselectively blocking these proteins which are more common in cancercells, using designer-pharmaceuticals, it may be possible to disrupt thefunctioning of cancer cells, without significantly affecting thefunctionality of regular cells. Also when developing pharmaceuticals forbacterial, prion and viral infections, it is useful to design apharmaceutical which selectively blocks proteins which are necessary forthe life and/or reproduction of the disease agent, but which does notblock proteins necessary for human cell survival.

Thus, it can easily be appreciated why pharmaceutical companies,research institutes and biotechnology companies maintain large databasesof partial mRNA sequences. Such sequences, known as ESTs (ExpressedSequence Tag), often have associated information, such as the tissuetype and/or disease type where the EST is expressed and/or theexpression level of the EST in these situations. Some databases includecomplete mRNA sequences. In some cases, a genomic database can beanalyzed to yield mRNA sequences, if the introns are correctlyidentified.

ESTs are generated using the following (greatly simplified) process: acell is selected and disrupted; proteins and other cell structures areselectively disintegrated; mRNA sequences are isolated and converted tocDNA sequences; cDNA sequences are inserted into host cells, which canbe cultured; individual host cells are disrupted; and a segment of DNAwhich includes the cDNA or original mRNA sequence at a known locationthereof is located and read out.

Unfortunately, the art of reading mRNA sequences is not yet completelydeveloped. The error rate of the reading increases with increasinglength of the mRNA sequence. The common errors are insertion or deletionof bases, and errors in the identification of individual bases. At acertain sequence length, the error rate increases to a point wherefurther reading is not possible. As a result, most ESTs are only 200-600bases long, while an average mRNA sequence is typically 1000-3000 baseslong.

In addition, EST databases contain many other types of errors, which maybe accumulated during the complicated process of EST generation inaddition to features, inherent in the mRNA, which make the assemblydifficult. These causes of difficulty include:

(a) Chimeric sequences. During the process of extracting and replicatingthe mRNA and cDNA, chimeric sequences may be inadvertently inserted intothe nucleotide sequences. Such chimeric sequences include ribosome RNA,junk sequences from the extraction and replication process,contamination from external sources, such as human cells andcontamination from the host cells.

(b) Intron Contamination. Introns are portions of the DNA which are notexpressed in the final mRNA product and are usually removed from themRNA during the middle of the transcription process (splicing). However,since the cell is disrupted in the middle of its normal activity, thetranscription process may be incomplete or otherwise disrupted, forexample by introns being incorporated in the mRNA sequences.

(c) Broken and respliced sections. During the process of extraction andreplication the mRNA sequences may be broken and, in some cases, may bereconnected, not necessarily correctly. In addition, whole sections ofmRNA sequences may be inadvertently removed.

(d) Alternative splicing. This is not an error in the ESTs but it is animportant cause of mismatch between ESTs. The transcription of DNA tomRNA does not follow a one-to-one correspondence. Depending on variousconditions in the cell, a single DNA sequence may be transcribed asseveral different mRNA sequences. The different transcriptions, namedalternative splice variants, are usually achieved by certain segments ofthe DNA being selectively spliced out. Thereafter, selected portions ofthe mRNA, named alternative spliced regions, are selectively spliced outof the mRNA sequence. As result, there may be two mRNA sequences whichdo not exactly match, even though they originate from the same DNAsequence and contain no errors.

(e) Redundancy Level. The process of extracting the ESTs includesreplication of mRNA sequences and there is usually more than one copy ofeach mRNA in a living cell. In addition, as most databases contain ESTsextracted in many experiments, many ESTs can be expected to appear inseveral experiments. As a result, there is a high redundancy of ESTs inthe raw database. However, due to the errors in reading out the ESTs,the ESTs will not exactly match. Also, even though there may besignificant overlap between two or more ESTs, they will usually havedifferent start and end points and different lengths. This lack ofconsistency makes the task of assembly more difficult.

As an end result, EST databases generally contain only short ESTs, whichmust then be correctly associated and assembled into the original mRNAsequences. However, due to the above-described problems, it is verydifficult to correctly match up the ESTs. In general, the limitingfactor in this field is information analysis, rather than informationvolume.

If the ESTs are correctly matched, the discovery and/or development ofnew pharmaceuticals, is made easier and faster. For example, assuming 20ESTs are determined by differential analysis to be found in a cancercell rather than a healthy cell, 20 leads must be pursued to find adrug, which may disrupt the cancer cell. However, if the 20 ESTs arecombined to form 2 complete mRNA sequences, only 2 leads need to bepursued, reducing the volume of work by a factor of 10.

SUMMARY OF THE INVENTION

It is an object of some embodiments of the present invention to providea method of mRNA assembly which reduces existing raw EST databases,removes errors therefrom and facilitates the creation of longer and/orcomplete mRNA sequences. The desired end result is a reduced database inwhich each mRNA sequence and/or EST encodes a different protein. Atleast, the ratio between the number of ESTs and the number of proteinsshould be reduced as much as possible. Two types of errors shouldpreferably be avoided and/or corrected: incorrect mRNA sequences anderrors of omission, where a real difference between two mRNA sequencesis lost, due to the method of reducing the raw database.

It is another object of some embodiments of the present invention toprovide a method of discovering hereunto unknown complete mRNA sequenceand/or genes.

It is another object of some embodiments of the present invention toprovide a method of modeling and discovering alternatively spliced mRNAsequences.

It is another object of some embodiments of the present invention toprovide a method of EST association and/or assembly which has a lowercomputational complexity than existing methods and is therefore suitablefor the analysis of huge databases of ESTs.

In accordance with a preferred embodiment of the present invention, aprocess of database reduction and/or analysis includes:

(a) correcting obvious errors in ESTs;

(b) clustering ESTs which appear to originate from the same mRNAsequence;

(c) assembling ESTs into mRNA sequences;

(d) comparing the assembled mRNA sequences to protein databases; and

(e) comparing the assembled mRNA sequences to genome databases.

The order of (a)-(e) is not fixed. For example, error correction may beperformed at any stage. Further, the process is preferably iterative,with later steps affecting earlier steps.

One aspect of some embodiments of the present invention relates to usinga method that directly compares a database with a database, rather thana method that compares an individual EST with a database. As a result, amore efficient analysis algorithm can be developed. In accordance with apreferred embodiment of the invention, an algorithm whose complexity isnear O(k(N)×N), where k is a slowly increasing function of N, ratherthan O(N²), (N is the number of ESTs) is provided. In huge ESTdatabases, this difference is extremely important and may pave the wayto using mRNA analysis of cells from biopsies to diagnose individuals,in a short time.

Another aspect of some embodiment of the present invention, relates to amethod of clustering ESTs. Rather than force a long segment of one ESTto match a second segment of a second EST, only certain annotatedportions of the ESTs are matched. In a preferred embodiment of theinvention, short segments, preferably 9 bases long, are used for thematching. An index is generated which lists, for each 9 base sequence(n-group), all the ESTs which contain that sequence. The list associatedwith each indexing n-group may then be treated as an individual(smaller) database. If the component database is small enough, it may bepreferred to use brute force methods to find matches within thecomponent database. Alternatively or additionally, at least larger onesof the component databases may be reindexed using the same method.Preferably, during such reindexing additional limitations are applied,for example, that the order of appearance of the n-groups is the same inthe matched ESTs or by indexing (and matching) only the n-groups whichare either consecutive, 1 or 2 bases away from the indexing n-group.Typically, ESTs are clustered when they contain 4 matching n-groups.

It should be appreciated that the size of the indexing base sequence maybe a number other than 9, although 9 appears to be suitable for rawdatabases of 100,000-1,000,000 ESTs of an average length of 400 bases.The length of the n-group may also be different for different iterationsof the method. It should be appreciated that, in general, longerindexing sequences are more sensitive to errors in the reading of themRNA sequences, however, they provide better matches. Further, thenumber of n-groups that must match is also a parameter, which may varydepending on the original database size, error rate and redundancylevel. Further, the number of bases allowed between two consecutiven-groups is also a parameter, which may vary responsive to the databasecharacteristics and the efficiency of the algorithm. In a preferredembodiment of the invention, each EST is graded as to its suitability tobe included in a certain cluster. In some cases, an EST may be suitablefor two clusters, especially if the two clusters are really a singlecluster. In addition, externally provided data, such as the informationthat two ESTs are probably from the same mRNA sequence, can also affectthe grade. Also the number of detected and/or corrected errors in aparticular EST and/or in the original database as a whole may affect thegrading process.

Another aspect of some embodiments of the present invention relates to amethod of assembly of clustered ESTs into mRNA sequences using graphs.Each unique segment of an EST is associated with a node in a directedgraph. The allowed transitions between nodes are restricted based on the“transitions” found in the ESTs that comprise the cluster. In accordancewith a preferred embodiment of the invention, the resulting graph isanalyzed to determine errors. For example, if there is more than one endnode in the graph, this may be indicative of a chimeric sequence. Alsoan end node which is too close (number of bases between) to a start nodeis also usually indicative of a problem. End nodes may be defined asnodes whose segments contain stop codons and/or as nodes which have notransitions thereafter. Alternative paths in the graph, in which both adirect transition and an indirect transition between two nodes areavailable, usually identify alternative spliced regions. In a preferredembodiment of the invention, mRNA sequences with one, two, three, fouror even more alternative spliced regions are correctly identified bypreferred embodiments of the invention. Thus, a large number of possiblealternative spliced variants, for a single mRNA sequence, may beidentified in a single tissue type. Generally, the larger the ratiobetween ESTs and mRNA sequences, the better the identification ofalternative spliced regions (and of errors in the sequence). Further,some preferred embodiments of the invention can also identify exclusivealternative splices, where each alternative spliced variant of the mRNAsequence contains a segment that does not appear in other variants.

Another aspect of some embodiments of the present invention relates tousing feedback from one step of the above-described process to affect adifferent step. In one example, an error in the assembly step, such asthe discovery of a chimeric sequence, may be used to change theclustering, by disallowing all matches based on the identified chimericsequence. A chimeric sequence may be identified by matching theassembled mRNA sequence to a database of known contaminates. Preferably,only suspected chimeric sequences are tested by comparison to a databaseof contaminates, at the assembly stage. Suspicious sequences arepreferably determined from the morphology of the graph. Another exampleis correcting errors in ESTs based on the assembly. Such correctederrors may also be propagated back to the clustering step.

Another aspect of the present invention relates to using an mRNAassembly method as a part of a diagnostic device. Such a device willreceive as an input a readout of ESTs, sequence the ESTs into mRNAsequences, correct errors in the sequences and then analyze theresulting mRNA expression spectra and/or compare it to known diseasetemplates to diagnose the disease. Such an input may, in some cases beof relatively low quality.

Another aspect of some embodiments of the present invention is relatedto diagnosing diseases and cellular dysfunction based on an analysis ofrelative expression levels of alternative spliced variants in a singletissue type.

Another aspect of the present invention relates to DNA chip design.Correct selection of DNA sequences to place on a DNA chip is limited bythe uncertainty of the relative importance and association of differentESTs. Once the ESTs are assembled into mRNA sequences, it is possible toselect one or more sets of DNA segments which will be most useful forthe DNA matching task. The high degree of automation possible with andthe quality of mRNA sequence determination, in accordance with preferredembodiments of the present invention, make such an analysis for DNA chipdesign a reality. Such a set can also take into account alternativesplicing and/or the types and distributions of different errors in theEST database. Thus, a DNA chip can be made more robust for a particularapplication. In one preferred embodiment of the invention, the indexingmethod is used to generate an index of all the short segments ofnucleotides in the mRNA sequences of interest. The length of the shortsegments is determined based on the design constraints of the DNA chip.The number of short segments necessary to correctly identify a singlemRNA sequence (or DNA sequences, in genomic applications) can bedetermined by the number of re-indexing steps required to isolate thatsequence in a database. The utilization of a DNA chip can be maximizedby selecting only mRNA sequences which can be identified using a minimalnumber of short DNA sequences.

There is therefore provided in accordance with a preferred embodiment ofthe invention, a method of obtaining an mRNA sequence having alternativespliced variants from a database of ESTs, comprising:

providing a raw database comprising a plurality of ESTs; and

assembling ones of said ESTs into mRNA sequences, wherein saidassembling includes identifying alternative spliced regions.

Preferably, the method includes clustering ESTs which have matchingsegments and wherein said assembly comprising assembling ESTs which areclustered together.

Alternatively or additionally, the method includes correcting errors insaid ESTs.

There is also provided in accordance with a preferred embodiment of theinvention, an mRNA sequence determined by the above described processes.Preferably, the sequence comprises at least two alternative splicedregions. Alternatively, the sequence comprises at least threealternative spliced regions. Alternatively, the sequence comprises atleast four alternative spliced regions. Alternatively or additionally,the sequence represents at least two alternative spliced variants ofmRNA sequence, each variant utilizing at least one mutually exclusivealternative splice region. Alternatively, the sequence represents atleast three alternative spliced variants of mRNA, each variant utilizingat least one mutually exclusive alternative splice region. Alternativelyor additionally, the sequence represents at least four alternativespliced variants of mRNA, each variant utilizing at least one mutuallyexclusive alternative splice region. Alternatively, or additionally, themRNA sequence is obtained from a single tissue type.

There is also provided in accordance with a preferred embodiment of theinvention, a method of tissue analysis comprising:

providing a biological sample;

determining relative expression levels of different variants of mRNAsequences in the biological sample which contain alternative splicedregions, to determine a spectra of relative expression of alternativespliced variants; and

analyzing said spectra to determine disease in the sample.

Preferably, analyzing comprises comparing said spectra againstpredetermined spectra. Alternatively or additionally, determiningrelative expression levels comprises:

analyzing said sample to detect ESTs; and

assembling said ESTs into mRNA sequences having alternative splicedregions.

There is also provided in accordance with a preferred embodiment of theinvention a diagnostic device comprising:

an input for receiving EST expression levels; and

a spectra generator which generates a spectra of mRNA expression levelsresponsive to said EST input.

Preferably, the spectra generator generates a spectra of relativeexpression levels of different variants of mRNA sequences containingalternative spliced regions. Alternatively or additionally, the devicecomprises a database containing expression spectra corresponding to aplurality of disease states. Preferably, the device comprises acomparator which compares the generated spectra with spectra in thedatabase to determine a disease state in the tissue which originated theESTs.

There is also provided in accordance with a preferred embodiment of theinvention, a method of clustering a plurality of ESTs, comprising:

indexing n-groups in the ESTs, to generate lists of ESTs which containeach particular n-group indexed; and

matching ESTs within each list to generate clusters.

Preferably, matching ESTs comprises indexing n-groups in each of saidlists to generate secondary lists. Preferably the method comprisesrecursively applying said indexing until recursively created secondarylists include ESTs containing at least three n-group matches.Alternatively, the method comprises recursively applying said indexinguntil recursively created secondary lists include ESTs containing atleast four n-group matches. Alternatively the method comprisesrecursively applying said indexing until recursively created secondarylists include ESTs containing at least five n-group matches.Alternatively or additionally, recursively applying said indexingcomprises recursively indexing only n-groups which are distanced fromthe first indexed n-group less than a certain number of bases.Preferably, the number of bases is less than five. Alternatively, thenumber of bases is less than four. Alternatively, the number of bases isless than three.

Alternatively or additionally, matching comprises correlating said ESTsusing an SW (Smith-Waterman) algorithm, modified to include detection oflong-gaps.

Alternatively, matching comprises correlating said ESTs using an SW(Smith-Waterman) algorithm.

In a preferred embodiment of the invention, said indexing comprisesignoring certain n-groups.

Preferably, the indexed n-groups are 9 bases long. Preferably, then-groups are between 5 and 15 bases long.

In a preferred embodiment of the invention the clustering methodincludes merging clusters. Preferably, merging clusters comprisesmerging responsive to an assumed error distribution in said ESTs.

There is also provided in accordance with a preferred embodiment of theinvention, a method of mRNA assembly from a plurality of ESTs,comprising:

determining a correspondence between segments in each EST; and

generating a directed graph in which each node represents a singlesegment, and each transition between two nodes represents the existenceof an EST in which the two corresponding segments are consecutive.

Preferably, the method comprises clustering said ESTs into clusters ofassociated ESTs, wherein said determining a correspondence is performedon individual clusters of ESTs. Alternatively or additionally, themethod comprises identifying alternative spliced regions from said graphbased on the morphology of the graph. Alternatively or additionally, themethod comprises correcting errors in said ESTs based on said graphbased on the morphology of the graph. Preferably, the method comprisesrepeating said clustering responsive to said corrected errors.

There is also provided in accordance with a preferred embodiment of theinvention a method of identifying errors in mRNA sequences, comprising:

generating a graph which represents the assembly of segments of ESTsinto an mRNA sequence; and

analyzing said graph to determine unusual configurations of said graph.

Preferably, said analyzing comprises identifying multiple end-nodes insaid graph.

There is also provided in accordance with a preferred embodiment of theinvention a method of tuning a database reduction process, comprising:

applying the database reduction process, with a certain value for atleast one parameter, to a sample database;

determining a reduction ratio in the database; and

reapplying said method with a new value for said at least one parameterif said reduction ratio is not achieved.

Preferably, the at least one parameter comprises the length of n-groupsused in matching two ESTs.

There is also provided in accordance with a preferred embodiment of theinvention a method of iterative clustering of ESTs, comprising:

clustering ESTs;

assembling clustered ESTs; and

re-clustering the ESTs responsive to errors detected in the ESTs aftersaid clustering.

There is also provided in accordance with a preferred embodiment of theinvention, a method of iterative clustering of ESTs, comprising:

deciding if two ESTs match, responsive to predetermined errorprobabilities of errors in said ESTs;

clustering said ESTs responsive to said match;

correcting said predetermined error probabilities, responsive to furtherprocessing of said ESTs; and

repeating said deciding and said clustering responsive to said correctederror probabilities.

There is also provided in accordance with a preferred embodiment of theinvention a method of EST database processing, comprising:

analyzing said ESTs to detect errors;

further processing said ESTs to create mRNA sequences;

determining, responsive to said further processing, corrections for saiderrors; and

correcting said errors.

Preferably, said further processing comprises assembling said ESTs intomRNA sequences.

There is also provided in accordance with a preferred embodiment of theinvention, a method of designing a DNA chip based on an EST setdetermined by differential analysis of two biological samples,comprising:

reducing said EST set to a set of mRNA sequences;

analyzing said set of mRNA sequences to determine short mRNA sequenceswhich maximally differentiate said mRNA sequences from mRNA sequencesfound in both biological samples; and

designing a DNA chip which detects said short mRNA sequences.

There is also provided in accordance with a preferred embodiment of theinvention, a method of designing a DNA chip to detect relativeexpression levels of different variants of mRNA sequences havingalternative spliced regions, comprising:

reducing an EST database to determine an mRNA sequence havingalternative spliced regions;

enumerating short DNA sequences which are only included in thealternative spliced regions of said different variants; and

designing a DNA chip which detects said short DNA sequences.

There is further provided in accordance with a preferred embodiment ofthe invention a DNA chip constructed based on the above design methods.

There is also provided in accordance with a preferred embodiment of theinvention, an mRNA sequence comprising at least two alternative splicedvariants, for a single tissue type. Preferably, the sequence comprisesat least three alternative variants. Preferably, the sequence comprisesat least four alternative variants.

There is also provided in accordance with a preferred embodiment of theinvention, an mRNA sequence comprising at least three alternativespliced regions. Preferably, the sequence comprises at least fouralternative spliced regions. Preferably, the sequence comprises at leastfive alternative spliced regions. Alternatively or additionally, themRNA sequence comprises different variants including mutually exclusiveregions.

There is also provided in accordance with a preferred embodiment of theinvention, a method of designing a DNA chip, comprising:

indexing an mRNA database to determine the indexing of short DNAsequences in the mRNA database, which short DNA sequences are of alength suitable for detection by a DNA chip;

determining from said indexing a set of short DNA sequences whichuniquely identify a desired mRNA sequence; and

designing a DNA chip which detects said set of short DNA sequences.

There is also provided in accordance with a preferred embodiment of theinvention an mRNA sequence substantially as described and shown in mRNAtranscripts included in the instant application.

There is also provided in accordance with a preferred embodiment of theinvention an mRNA sequence having alternative spliced variants,substantially as described and shown in the instant application.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be more clearly understood from the detaileddescription of the preferred embodiments with reference to theaccompanying figures, in which:

FIG. 1 is a flowchart of a method for generating mRNA sequences from adatabase of ESTs, in accordance with a preferred embodiment of theinvention;

FIG. 2 is a schematic illustration of ESTs clustered into a number ofexclusive clusters;

FIG. 3 is a flowchart of a method for EST clustering, in accordance witha preferred embodiment of the invention;

FIG. 4 is a schematic illustration of a partial matching between twoESTs, in accordance with a preferred embodiment of the invention;

FIG. 5 is a schematic illustration of an index of an EST database, byn-groups, in accordance with a preferred embodiment of the invention;

FIG. 6 is a flowchart of a method of assembling clustered ESTs, inaccordance with a preferred embodiment of the invention;

FIG. 7 is a schematic illustration of matched ESTs; and

FIG. 8 is a illustration of a graph corresponding to the ESTs of FIG. 7,in accordance with a preferred embodiment of the invention.

DETAIL DESCRIPTIONS OF PREFERRED EMBODIMENTS

FIG. 1 is a flowchart of a method for generating mRNA sequences from adatabase of ESTs, in accordance with a preferred embodiment of theinvention. The databases usually contain raw data and may be providedeither from an existing EST database or by analyzing a particular cellto determine ESTs therefrom, possibly using a method well known in theart. Thus, the first step is preferably automatic correction of obviouserrors. It should however be noted, the error correction may beperformed at additional or alternative steps in the process.

Error Identification and Correction

It should be noted that some errors, which may seem obvious, may turnout, at a later stage, not to be errors at all. Thus, an importantaspect of error correction is preferably error identification. Further,in some cases, errors will be identified at an earlier stage andcorrected only at a later stage (when they are confirmed). In apreferred embodiment of the invention, any errors that are corrected arepreferably marked, so that such corrections may be automatically undone,if needed.

There are two types of error identification schemes which are preferablyused:

(One) Analysis of the original readout data (trace information) toprovide the probability of correct identification of bases. Thisinformation may be stored in the database. In some cases theprobabilities of error in bases identification is dependent on theequipment used to read the ESTs. In a preferred embodiment of theinvention, error probabilities are assigned based on the type ofequipment and/or other characteristics of the readout process. Inanother preferred embodiment of the invention, error probabilities aredetermined and/or updated by analyzing the type and distribution oferrors in mRNA sequences which were identified in an earlier iterationof the process. In a preferred embodiment of the invention, such anearlier iteration is limited to assembling mRNA sequences which areknown to be in the raw database. Thus, when the ESTs associated with themRNA sequence are found, the distribution of different types of errorscan be determined by comparing the ESTs with the known correct mRNAsequence. Such a limited iteration may use EST-to-database matchingtechniques, which are known in the art or it may use a subset of thetechniques described herein.

(Two) Analysis of the ESTs to detect suspicious portions, for examplemultiple repeats of single bases or short sequences.

There are several types of errors that are preferably corrected at thisstage:

(a) Extra strings of “A” type bases. During the process of maturation ofthe mRNA, a long string of “A” bases are usually attached to the mRNA.Although these strings are generally automatically removed, some suchstrings may remain as contaminates and disrupt the assembly of mRNAsequences.

(b) Host DNA. The sequence of DNA just prior to and just after the mRNA,in the host cell, is well known, so a contamination by such a sequencecan be easily detected and removed.

(c) Insertions and deletions. Missing and extra bases can usually bedetected, since they render the rest of the mRNA sequence to benonsense. mRNA, unlike DNA, does not usually contain nonsense segmentsand especially not series of stop codons. In some cases the missing basecan also be guessed at or the range of possibilities narrowed, bypresuming that the resulting codon must be for an amino acid.Alternatively, a blank codon is inserted and the correct codon isdetermined during the assembly step, described below. Furtheralternatively, no codon is inserted, but the error is preferably noted.

Clustering

After the ESTs are corrected for obvious errors, they are clustered intogroups, each group supposedly containing ESTs from only a single gene.FIG. 2 is a schematic illustration of ESTs (indicated as short lines)clustered into three clusters 20, 22 and 24. In a typical raw database,containing over a million ESTs, the number of expected clusters is about20,000-50,000.

FIG. 3 is a flowchart of a method for EST clustering, in accordance witha preferred embodiment of the invention. Rather than attempt to analyzethe entire database in one go, the database is divided into componentdatabases (some of which will usually overlap). Each such componentdatabase is preferably generated by indexing the entire database,described in more detail below. Matching ESTs is then performed in eachof these component databases. At a later stage, the analyzed componentdatabases, significantly reduced in size, may be merged together. Thecomputational complexity of the matching task with the merged databaseis thus substantially reduced over that with the whole originaldatabase. It should be noted that the division into component databasesis preferably strongly related to the EST matching. So dividing up thedatabase does not adversely affect the completeness of EST matching inthe database as a whole.

FIG. 4 is a schematic illustration of a partial matching between twoESTs 26 and 28. A segment 30 in EST 26 matches a segment 34 in EST 28and a segment 32 in EST 26 matches a segment 36 in EST 28. Instead oftrying to match long segments of EST 26 and EST 28, in accordance with apreferred embodiment of the invention, only short segments are matched.It should be appreciated that the error rate in EST sequences is severalpercent, thus, the longer the segment matched, the higher the chance ofmissing a proper match, due to errors in the ESTs. However, when shortersegments are used, the number of matches between unrelated ESTsincrease. One aspect of some embodiments of the present invention solvethis problem by iteratively applying a matching process. Preferably, thenumber of bases in the segment matched is 9 (an n-group). However, thisnumber is preferably a parameter dependent, inter alia, on the databasesize. A group of 9 bases is typically suitable for a database of severalhundreds of thousands of ESTs, each EST of between 200 and 600 baseslong. In other sizes of databases, other values may be used, preferablyin the range 5-20, more preferably in the range 7-11.

In a preferred embodiment of the invention, the first step of clusteringcomprises generating an index of all the n-groups in the EST database.FIG. 5 is a schematic illustration of an index of an EST database, byn-groups, in accordance with a preferred embodiment of the invention.Each n-group has associated therewith a list of all the ESTs thatcontain that n-group anywhere in the EST (not only on boundaries oftriplets). Each of these lists defines a database, within which all theESTs may be related. There is, of course, the possibility of two relatedESTs not being in the same list. Also, two ESTs might have a matchingn-group even if they are completely unrelated. In addition, the same ESTwill probably appear in a very large number of lists, generally amonotonic increasing function of the length of the EST. The location ofthe n-group in the EST is also associated with each element of the list.If an n-group appears more than once in an EST, it is preferably enteredin the list several times, each time with a different associatedlocation of the n-group.

Each of the lists of a common n-group can, as described above, betreated as an individual database, with any type of EST matching methodknown in the art performed thereon. However, in a preferred embodimentof the invention, especially when the list is very long, theabove-described method of indexing is reapplied. The resultingsecond-order lists contain ESTs in which two n-groups match. Preferably,the process is repeated until there are four matching n-groups in eachlist. In a preferred embodiment of the invention, the reindexing isperformed by intersection between the list of the common n-group andlists corresponding to other n-groups. The resulting lists may be usedas seed clusters. The number of matching n-groups is also a parameterwhich may depend on, inter alia, on database size and errordistributions.

In a preferred embodiment of the invention, at least some of there-applications of the indexing method, include adding additionallimitations. One type of limitation is requiring that the order of thematched n-groups be the same in matched ESTs. Another type of limitationis that at least some of the n-groups must be distanced by a minimumnumber of bases from other n-groups, thus a larger overlapping segmentbetween ESTs is required for them to match. Yet another type oflimitation is that the matched ESTs be substantially consecutive.

In this last type of limitation, rather the re-index all the n-groups inthe ESTs, only the n-groups which are consecutive with, or distanced bya small number of bases, such as 1 or 2, from the common n-group, areindexed. Of course, other distances between the n-groups, such asdistances smaller than 20 and more preferably, smaller than 10, may alsobe used. By requiring such a short distance between consecutive matchingn-groups, an effective 18-group is formed and a match between two ESTsimplies a match of 18 consecutive bases (27 and 36 in later iterations).However, by allowing 0, 1 or 2 bases to appear between the n-groups,small insertions and deletions of bases may be overcome. In addition,using such short matching sequences allows even rather short ESTs, suchas EST fragments, to partake in the clustering process.

It should be appreciated that some of the index lists are longer thanothers. In some animal species, the occurrence of some n-groups is morecommon than in others. In addition, due to statistical considerations,some n-groups will be more common than others. It should be noted thatif an n-group is too common, the number of correct associations betweenESTs using that n-group will be significantly lower than the number ofincorrect ones. This is especially true for poly-A sequences and forrepetitive DNA sequences. In a preferred embodiment of the invention,clustering is started from the shorter lists, i.e., those whichcorrespond to the less common sequences. Preferably, once all, or most,of the ESTs are clustered, the clustering is stopped. Alternatively oradditionally, lists containing more than a certain percentage of ESTsare ignored. Thus, not all the component databases need to be processed.This percentage is preferably a parameter, preferably dependent on thedatabase size and of the type of distribution of the n-groups. In apreferred embodiment of the invention, a database of n-groups lesspreferred for matching is maintained and, if possible, listscorresponding to these n-groups are not indexed and/or ignored.Alternatively or additionally, the more common n-groups are not indexedat all. In a preferred embodiment of the invention, the relativedistribution of n-groups is determined by indexing a statisticallysignificant sample of the EST database.

In a preferred embodiment of the invention, n-groups of portions ofESTs, which include errors, and/or n-groups of portions of ESTs, whichinclude corrected errors, do not participate in the indexing.Preferably, once the errors are corrected, these n-groups are indexedand the clustering is updated. Alternatively, such n-groups are gradedwith a lower grade than (supposedly) error-free n-groups. The decisionwhether to associate an EST with a cluster may be made based on thegrade. Further, even if no errors are detected, some n-groups may beassigned a lower grade than other n-groups, for example n-groups ofconsecutive bases of a single type. Still further, a particular EST maybe assigned a lower grade than other ESTs due to problems which occurduring the reading of the EST. Matching this EST to a cluster willpreferably require a higher definiteness of matching, such as requiringfive matching n-groups instead of four.

Once the ESTs are grouped into seed clusters, the clusters arepreferably merged into large clusters. In a preferred embodiment of theinvention, a Union-Find algorithm, which is known to have a lowcomputational complexity, is used to perform the merge. Since a same ESTcan appear in more than one seed cluster, any two clusters which includethe same EST may be merged. In addition, sometimes two ESTs are known tobe from the same mRNA sequence, for example, when they are read out fromopposite sides of the same mRNA. In this case, clusters containing theseESTs may also be merged. In some cases, two clusters will not be merged,based on feedback from a later stage in the processing. One example ofsuch feedback is the identification of a common EST or of a matchedportion of an EST as a chimeric segment. In a preferred embodiment ofthe invention, an existing cluster may be split apart and the assemblythereof repeated. In this embodiment, the history of the matches, whichmerged the cluster, are preferably saved, to facilitate splitting it.Alternatively, the cluster may be split by identifying the incorrectmatches and then splitting the cluster based on the remaining matches.

In a preferred embodiment of the invention, the correspondence betweenthe ESTs is used as a starting point for the assembly step, describednext.

Assembly

FIG. 6 is a flowchart of a method of assembling clustered ESTs, inaccordance with a preferred embodiment of the invention. First, the ESTsare arranged with corresponding segments of the ESTs identified. In theclustering step of the algorithm, two ESTs were associated if they hadfour consecutive corresponding n-groups. When the ESTs are arranged forassembly, it is expected that the corresponding segments besubstantially longer.

In accordance with one preferred embodiment of the invention, thesegments are identified and matched using a standard algorithm such as acorrelation algorithm, a BLAST algorithm, a FASTA algorithm or an SW(Smith-Waterman) algorithm. Alternatively, the matching of ESTs isperformed by expanding the matching of the n-groups to adjacent bases,until each segment of each EST, is either matched to a correspondingsegment or determined to be unmatched. Typically, there will be somevagueness regarding the exact extent of the segments, especially at theends of ESTs. This may be due to missing bases at the ends of the ESTs.In addition, there is not usually an exact match between two segmentsdue to errors in the ESTs. These types of errors are preferablycorrected as described below.

In a preferred embodiment of the invention, an identified segment issplit (and the split propagated to other ESTs where the segment has beenidentified) when the segment matches only a part of a correspondingsegment in a different EST. Preferably, the correlation level at which asegment is split is a parameter of the system.

One problem with best match correlations between two ESTs is thatsimilar ESTs can originate from different, yet homologous genes. In apreferred embodiment of the invention, two segment matching algorithmsare used to align the ESTs, one algorithm which attempts to detect thattwo ESTs are from homologous genes and one which attempts to detect thatthe ESTs are from the same gene. Alternatively, a single algorithm isused, which generates a probability of two ESTs being from the samegene, from homologous genes or unrelated. One example of algorithmswhich attempt to detect that two ESTs are from homologous genes is theGeneWise family of algorithms. The previously described correlation,BLAST, FASTA and SW algorithms attempt to detect that two ESTs are fromthe same gene.

In a preferred embodiment of the invention, a modified SW type algorithmis used to generate correspondences between ESTs. In a regular SWalgorithms, a penalty is attached for each missing or extra base. Inaccordance with a preferred embodiment of the invention, the followinggrading scheme is used, which includes a new situation, “long gap”:

First Gap: 12 penalty points

Following Gaps: 4 penalty points

Match: 4 bonus points

Mismatch: 9 penalty points

Long Gap: 50 penalty points

Thus, long gaps, which correspond to alternative spliced regions extracta large penalty, however, they do not generate as many penalty points asunder the unmodified SW algorithm.

In one preferred embodiment of the invention, generating thecorrespondence between EST pairs is performed as part of the clustering.

FIG. 7 is a simplified schematic illustration of four ESTs 40, 42, 44and 46 arranged to show correspondence between segments thereof. In atypical database, there will be many more than four ESTs for each mRNAsequence. Before actually assembling an mRNA sequence, the overlapbetween the segments of ESTs is preferably used to correct errors in theESTs. The corrections performed preferably include:

(a) replacing single bases with other type bases, preferably based on avoting algorithm between all the corresponding segments; and

(b) correction of insertions or deletions of single bases, or a smallnumber thereof, preferably based on a voting algorithm.

Once the corresponding segments are corrected, a directed graph is builtto represent the cluster. In this graph, each node represents a singlesegment and the allowed transitions between nodes are exactly thosetransitions which correspond to two segments being consecutive.Alternatively, some or all of these corrections are performed only afterthe graph is generated and/or analyzed to correct errors. In a preferredembodiment of the invention, two nodes in which the origin node has onlyone exit and the ending node has only one input are collapsed into asingle node, to simplify the resulting graph. In a normal situation,with no alternative splices, a single node, which represents theconsensus of ESTs may suffice for describing an mRNA sequence. In apreferred embodiment of the invention, the graph is built incrementallyby adding the effects of ESTs to the graph, on a one by one basis. A newEST will generally modify an existing graph by adding a new segment orby bridging two existing (possibly unconnected) segments. For example,two ESTs might be known to be associated because they are from oppositeends of a single mRNA sequence. However, until the gap between thesequences is bridged by one or more EST, it is not possible to determinetheir exact correspondence. Preferably, the graph is stored with thereduced database to facilitate adding further ESTs to the graph and/ordatabase later.

FIG. 8 is a illustration of a graph corresponding to an assembly of ESTs40, 42, 44 and 46 of FIG. 7, in accordance with a preferred embodimentof the invention.

An exemplary process of building the graph of FIG. 8 is as follows:

(1) a graph having a single node A is generated, where node Acorresponds to segment A of EST 40;

(2) a new node B is generated for segment B which is common to ESTs 40and 42;

(3) a transition between node A and node B is defined, based on theirbeing consecutive in EST 40;

(4) a new node C is generated for segment C of EST 40;

(5) a transition between node B and node C is defined, based on theirbeing consecutive in EST 40;

(6) a new node D is generated for segment D which is common to ESTs 40and 42;

(7) a transition between nodes C and D is defined, based on their beingconsecutive in EST 40;

(8) a transition between nodes B and D is defined, based on their beingconsecutive in EST 42;

(9) a new node E is generated for segment E which is common to ESTs 40,42 and 46;

(10) a transition between nodes D and E is generated, based on theirbeing consecutive in EST 40 and EST 42;

(11) a new node F is generated for segment F which is common in ESTs 40,42 and 46;

(12) a transition between nodes E and F is generated, based on theirbeing consecutive in EST 40 and EST 42;

(13) a new node G is generated for segment G of EST 46;

(14) a transition between nodes E and G is defined based on EST 46;

(15) a transition between nodes G and F is defined based on EST 46;

(16) a new node H is generated for segment H which is common to ESTs 42and 46;

(17) a transition between nodes F and H is defined based on EST 42 andEST 46;

(18) a new node I is generated for segment I in EST 46;

(19) a transition between nodes H and I is defined based on EST 46;

(20) a new node J is generated for segment J which is found only in EST42;

(21) a transition between nodes H and J is defined based on EST 42;

(22) a new node K is generated for segment K in ESTs 42 and 44;

(23) a transition between nodes J and K is defined based on EST 42;

(24) a transition between nodes K and I is defined based on ESTs 42 and44;

(25) a new node L is generated for segment L in ESTs 42 and 44;

(26) a transition between nodes I and L is defined based on ESTs 42 and44;

(27) a new node M is generated for segment M in EST 42; and

(28) a transition between nodes L and M is defined based on EST 44.

As previously mentioned, some of the nodes may represent very shortsegments. The minimum length of segment is preferably a parameter. In apreferred embodiment of the invention, such short segments are eitherignored (dropped, as a type of error correction) and/or attached to anadjacent node. One example in FIG. 8 would be if segment J was such ashort segment. In addition, the final mRNA sequence preferably includesthe end UTRs.

In a preferred embodiment of the invention, each node stores theoriginal segments of the ESTs, their correspondence and/or any errorcorrection performed thereon. Preferably, each node also stores arepresentation of the mRNA sequence which is created by merging thesesegments. Thus, when matching an EST to a node, the EST can be matchedto any of the original ESTs and to the error corrected result of theircombination.

The graph in FIG. 8 indicates that the mRNA sequence contains threealternative spliced regions: C, G and JK. However, it should beappreciated that from the information contained in the ESTs, there is noclear indication whether each of these alternative spliced regions isindependent from each other. In a preferred embodiment of the inventionthe graph includes information which limits transitions based onpreviously selected transitions, responsive to dependencies betweenalternative spliced variants found in the EST database. In a preferredembodiment of the invention, either a “closed world assumption” or an“open world assumption” is used to decide whether a certain transitionis allowed, based on the types of transitions found in the ESTs.

In accordance with a preferred embodiment of the invention, theresulting graph is analyzed to detect errors in the reading out of themRNA. As described above, alternative spliced regions are clearlyindicated by cycles in the graph. In some cases, where the alternativespliced region is at the end of the graph, the graph may have two endingpoints. An ending point may also have transition to other nodes (forexample node L in the example of FIG. 8) if a very large number of ESTsend at segment L (even without a stop codon) and only a small numbercontinue. Such a case may also indicate that segment M is an artifact ora rare alternative splice. The length of segment M and the number andthe average overlap between ESTs may be used to determine whethersegment M is an artifact. Preferably, such determination is based onstatistics gleaned from other mRNA sequences in the database or insimilar databases.

Chimeric sequences usually create artifacts in the graph. For example,if the resulting graph comprises a first start node and a first end nodeconnected by a single transition and a second start node and a secondend node also connected by a single transition, this graph correspondsto two mRNA sequences. However, if there is an extra transition betweenthe first start node and the second end node, this transition may beindicative of a chimeric sequence. This suspicion becomes a nearcertainty if this transition is supported by only a single EST.

In accordance with a preferred embodiment of the invention, thefollowing algorithm is used to generate a graph from ESTs. Thisalgorithm assumes that an existing graph already exists and that a newEST is to be added to the graph. The first EST will generate a graphhaving a single node, which node will include a segment corresponding tothe entire EST. It should be noted that this algorithm does not requirethat all the nodes of the graph be connected. The algorithm is:

(a) Match the new EST to the segments stored in each node of the graph.

(b) If the EST matches a node or a sequence represented by a contiguousseries of nodes, the EST is merged with the nodes that it correspondsto.

(c) If a portion of the EST does not match, a new node is created forthat portion. If the new node is at one of the ends of the graph, it maybe incorporated in an existing node and used to extend the length of thesegment represented by the node.

(d) If there is an extra or missing portion in the EST, whichcorresponds to the middle of a segment in an existing node, the existingnode is split, a new node is generated for the none-matching portion andtransitions are created between the two parts of the original node andbetween the new node and those parts of the existing node which matchportions of the EST.

(e) A new transition is created between two existing nodes if the ESTcontains a contiguous sequence, the ends of which match portions of thetwo existing nodes.

In a preferred embodiment of the invention, a grammar is used todescribe the gene sequences, where each token corresponds to a base or asegment, instead of or in addition to a graph type representation. Forexample, an LR(1) grammar or an LALR(1) grammar, may be used. In such acase, the errors in the gene sequences are preferably determined byapplying grammar matching rules to the gene sequences. Preferably, eachgrammar matching rule is associated with a probability of a particulartype of error. One example is using LEX, YACC or other lexical and/orgrammar programs, known in the art.

It should be noted that, by using the graph type representation, morethan one configuration of mRNA sequence, each of which corresponds to adifferent alternative spliced variant of a single mRNA sequence, can bedetected in a single tissue type. Further, other types of mRNA variants,such as those caused by mutated genes and/or by other causes can also bedetected. In addition, the ratio between two (or more) alternativespliced variants is preferably determined by counting the number of ESTsassociated with each mRNA configuration and/or by comparing theirexpression levels. It should be noted that even three, four, five ormore alternative spliced variants can be simultaneously determined usingthe above described method. Further, such configurations can have one,two, three, four or more alternative spliced regions which are not thesame in different variants. The determined alternative spliced variantsmay be represented in graph form or as a regular expression or as agrammar rule. Alternatively, alternative spliced mRNA sequences arestored as a set of mRNA sequences, each of which corresponds to a singlevariant. Once an mRNA sequence is obtained, it is also possible togenerate a real nucleotide sequence, using methods well known in theart.

In a preferred embodiment of the invention, the entire process ofdatabase reduction is repeated. ESTs which were not associated with anymRNA sequence may, at the second iteration, be associated with an mRNAsequence. Alternatively, ESTs which remain unmatched are discarded.

Protein Matching

After the mRNA sequences are generated it is useful to compare theprotein which is encoded by the sequence to an existing proteindatabase(s). Near matches may be used to determine errors in the mRNAsequences. Such errors are preferably corrected and their effect fedback to earlier steps in the algorithm. In addition, chimeric sequencesmay be discovered by such a comparison. Further, alternative splices inthe mRNA or in the protein database may be determined by comparing thetwo reduced database and the protein database. In such a case, the mRNAsequence and/or the protein database are preferably updated to includethe newly discovered alternative splices. Near and exact matches arepreferably associated with the reduced mRNA sequences for further use.In addition, some types of near matches and matches with proteinfamilies and/or domains can also be used to determine various functionalcharacteristics of the protein. Further, if two clusters each matchdifferent parts of a known protein, these two clusters are preferablymerged.

In a preferred embodiment of the invention, the graph of the mRNAsequence is compared as a whole to a protein database. Thus, ambiguousbases and codons can be resolved at a later date, when the graph iscompared to a protein database.

In a preferred embodiment of the invention, data from different rawdatabases are combined by comparing the mRNA sequences and theirassociated proteins in the two databases, after reduction of each rawdatabase, rather than by combining the two raw databases. Alternatively,two EST databases may be combined after clustering. Preferably, errorcorrection is not propagated between two such databases. Alternatively,at least some of the error identification and/or correction arepropagated between the databases. It should be noted that, using some ofthe methods described herein, it is possible to combine an EST databasewith an mRNA database, since ESTs are similar to mRNA sequences, onlyshorter. Further, some mRNA sequences have been determined from genomicdatabases, by identifying the introns.

In accordance with another preferred embodiment of the invention, thefunction of the protein encoded by the determined mRNA sequence isanalyzed by finding known proteins with a similar structure. This typeof analysis is especially useful to discover cells in non-humancreatures which create proteins which are similar to human proteins.These types of cells are useful for studying the functioning of theprotein.

Genome Matching

Additionally or alternatively to matching the mRNA sequences to proteindatabases, the mRNA sequences may be matched to genome database(s), tofind the gene from which the mRNA was transcribed. This type of matchingcan serve several purposes. First, some types of errors can be correctedby comparing the mRNA sequence with the source DNA. Second, by comparingthe mRNA with the source DNA, the mechanism of alternative splicing,especially or the particular mRNA sequence, may be determined. Third,this comparison can serve as a diagnostic tool by identifying criticalmutations. For example, certain types of cancer are caused by mutationsin the DNA. Such mutations will generally causes changes in thetranscription of the mRNA, which can be determined by comparing the mRNAagainst a baseline database, such as the human genome project database.

Integration with Database Information

One aspect of some embodiments of the present invention relates tointegrating information in the raw database with the process ofclustering, assembly an error correction and/or identification. Inaccordance with one preferred embodiment of the invention, theclustering, assembly and/or error correction and identification aretested against the database information to determine the possibility oferrors. Such errors are preferably fed back to earlier steps in theprocess. Alternatively or additionally, the raw database information isintegrated into the above-described processed. One example ofintegration is that the probability of two ESTs belonging to the samecluster increases if they are from the same tissue type. Another exampleof integration is that if two ESTs have similar expression levels thereis a higher probability that they are associated (and vice-versa).

In a preferred embodiment of the invention, each EST has associatedtherewith the probabilities of correct identification and/or ofmisidentification of bases in the particular EST during the EST readout.Alternatively or additionally, the raw database includes statisticalranges for acquisition parameters, from which probability of variouserror types may be determined.

In a preferred embodiment of the invention, the EST database preferablyincludes one or more of the following items of information, inter alia,the following information: tissue type, cDNA library origin, clone name,expression level and chromosome association. Being from the same cloneand/or from the same chromosome preferably increases the probability oftwo ESTs belonging to the same cluster.

Variations and Caveats

In some cases, an EST which exists in the database is the complement ofa correct EST. In a preferred embodiment of the invention, ESTs areentered both in their original form and as their complements.Alternatively, each time two ESTs are matched or other reference is madeto the base sequences of the ESTs, provision is made for the complement,such as by generating an index of both the original EST and itscomplement.

It should be appreciated that the above described methods can be applieddirectly to chromatograms. It should be noted that by directly analyzingthe chromatograms it is possible to determine various errorprobabilities, such as the probability of mis-identifying a particularbase.

It should be noted that even though this process is especially adaptedfor removing redundancy and generating mRNA sequences in an entiredatabase, it may also be used to compare a single EST with a database.In a preferred embodiment of the invention, the n-group index of thedatabase is stored with the database, to facilitate such matching. Inaddition, such an indexed database is easier to combine with a seconddatabase, since the clustering is easier and faster to update.

One important issue in tuning the algorithm is determining how to applylimitations to the reduction process. If the reduction process is toostrict, the process might fail, since the biological data involved hasmany errors. However, if the process is too lax, the reduction ratiowill be small and many of the associations will be mistaken. In somepreferred embodiments of the invention, most limitations are applied atlater stages and then fed back to previous stages. Alternatively, a morestrict process may be used, where if the results are not sufficient, thefeedback to earlier steps makes them less strict, by removing previouslyapplied limitations. Alternatively, a more lenient process may be used,where if the results are not sufficient, the feedback to earlier stepsmakes them more strict, by adding limitations.

There are many places in the above described database reduction methodwhere there is a wide latitude for applying limitations, for example, bychanging decision parameters. A most important such place is inassociating an EST with a cluster. Rather than a binary description, itis possible to use a fuzzy-logic type description of an EST belonging toa cluster. Additionally or alternatively, a grade may be assigned toeach association of an EST with a cluster, based on feedback from othersteps, database information, such as tissue type and/or expressionlevel. In one example, the grade of a matching of an EST to a cluster isdependent on the number of matching n-groups and on the required spacingbetween the n-groups in the particular EST.

It should be appreciated that every time a correction is made to theESTs, there is a possibility that the correction should not be made orthat a different correction should be made. In a preferred embodiment ofthe invention, an expert system is used to decide whether to apply acorrection. One input to such a system are a priori probabilities ofdifferent types of errors and/or of possible corrections to certaintypes of errors. Another input is feedback from later stages of thealgorithm, at which an a postriori probability for these errors can begenerated, for a particular database and/or EST. In addition, thefeedback can contain information about the correctness of correctingparticular errors. For example, at an earlier stage, a base is deemed tobe missing from an EST, so, a new base is inserted at a particularlocation. In a later stage, such as assembly, it is determined that theassumed location of the missing base was not correct. Not only can theproper correction be made, but, preferably, the clustering is performedagain, since by adding or removing a base, the matching of n-groups maybe changed.

It should be appreciated that performing the clustering again is not adifficult task, since, in most cases, the existing indexes need to beonly slightly modified. In addition, it is possible to update only theclustering of the ESTs which had high grades of matching.

In a preferred embodiment of the invention, the output of the processincludes a certainty value for each assembled mRNA, and, preferably,also for each correction on the assembled mRNA sequences. This value ispreferably bases on the above described grades and/or on a comparisonwith protein databases and genome databases.

It should be appreciated that the above described order of applicationof steps is not required for the operation of the reduction method,rather, the above described order is varied in preferred embodiments ofthe invention. In one example, error correction is applied afterclustering. In another example, comparing to genome databases isperformed before clustering, to assist clustering. Error correction inparticular may be performed at many different times during the process,since information necessary to identify and/or correct errors iscontinually being collected and./or updates as the process progresses.

The above-described process has been described as a combination of manyfeatures, decision parameters and probabilities. It should beappreciated that not all of these above-described features, decisionparameters and/or probabilities are utilized in all preferredembodiments of the invention.

In accordance with a preferred embodiment of the invention, the abovedescribed process is embodied in a general purpose computer runningsoftware. Such software may be provided on a computer readable media,such as a diskette or a tape. Alternatively, some of the software is runon a Bioccelerator, available from Compugen Ltd. of Petach Tikva,Israel.

Attached herein as an appendix “A” is software suitable for performingsome of the above-described preferred embodiments of the invention. Itshould be noted that this software is provided prior to integrationthereof, so that various bugs in the program may exist. In particular,these are provided scripts which convert data from formats suitable forone module of the program to a format suitable for other modules. In thecurrent state of the supplied software, the software is suitable forapplication as a set of tools, which may be manually applied to performvarious steps of some the above described methods.

An important aspect of using an automated algorithm to correct errorsand/or sequence mRNA sequences from EST database is the fine-tuning ofvarious parameters therein. In particular, in the present algorithm, thesize of n-groups, the number of n-group matches needed to associate twoESTs, the method of grading an association of an EST to a database, theprobability level at which a base is assumed to be correctly identified,the weight placed on tissue type identification when associating ESTsand the allowed distance between two matching n-groups are all importantparameters in various embodiments of the present invention. The valuesof these parameters is dependent, to some extent on the size of thedatabase, the average size of ESTs and the type and distributions oferrors in the ESTs. Typically, the above-described methods are appliedin order to minimize confusion in the database. As such, one importantgoal is to reduce redundancy, especially by associating two ESTs andmerging them together into a single mRNA sequence. Another importantgoal is not to create association between two ESTs which are not trulyassociated.

In accordance with a preferred embodiment of the invention, variousparameters of the database are manually and/or automatically adjusted tomaximize these goals. In one example, the set of n-groups to be ignoredwhile indexing is automatically determined by analyzing the database. Inanother example, the distribution of errors is obtained by applying afirst iteration of the method to the database or a portion thereof.Thus, in an iterative application, the first iteration may be considereda calibration run. In another example, the method is applied to a rawdatabase with one set of values for various parameters. If the resultingcompression and/or data quality are insufficient, the set of values arechanged and the method is re-applied. Preferably, search methods, wellknown in the art, are used to guide the modification of the set ofvalues.

It should be appreciated that automatic calibration is especiallyimportant in a medical laboratory setting, where a large number ofsamples are to be analyzed by a single machine. In such settings, anexpert who knows how to adjust the device will generally not beavailable. Further, a great variability between the samples, especiallywith regard to size and error types may also be expected.

It should be appreciated that the above described process of databasereduction comprises several distinct steps and ideas, some of theseideas can be practiced in isolation from other ideas, in some preferredembodiments of the invention. Alternatively or additionally, varioussteps in the process may be replaced by equivalent steps which are knownin the art, without affecting the spirit of some embodiments of thepresent invention.

It should be appreciated that some of the steps described above may beused as stand-alone modules for other tasks. For example, the method ofindexing EST to facilitate clustering is also useful for searching DNAdatabases. In addition, application of set-operators, such asintersection, union and difference on two indexed databases is muchfaster than on non-indexed database. Typically, a first step in applyingsuch operators is to identify which ESTs are similar in the twodatabases. A large percentage of the similar ESTs are indexed undersimilar n-groups. Thus, a reasonably good intersection between databasescan be obtained by comparing only ESTs which are indexed using the same(or mostly the same, as some errors are to be expected) n-groups.

Applications

In the present art of mRNA analysis, the portion of the task betweenobtaining the tissue sample and reading out the ESTs requires only time,patience and a skilled technician. However, Once the ESTs are gathered,combining them to form correct mRNA sequences requires an expert withextensive experience and superior abilities. In addition, even once themRNA sequences are determined, many are incomplete or incorrect sincethey do not take into account the possibility of alternative splicing.

In accordance with a preferred embodiment of the invention, there is noneed for an expert individual to create the mRNA sequences. Rather, thegeneration of mRNA sequences from ESTs is made more simple and moreautomatic.

In accordance with a preferred embodiment of the invention, diseases arediagnosed by mRNA analysis, without the need for a highly qualifiedperson to aid in mRNA assembly. A sample of tissue to be analyzed isremoved from a patient's body. This tissue is then processed to produceESTs. The ESTs are inputted into a device in accordance with a preferredembodiment of the invention and a spectrum of mRNA in the tissue sampleis generated. This spectrum can be automatically compared to knownspectra of diseases, such as cancer, to determine the existence and typeof cancer. In addition, the spectrum can be analyzed to determine theinstant function of the cells in the tissue sample (healthy cells,stressed cells and ill cells all express different proteins). Suchanalysis may be automatic, by comparing the measured spectrum against astandard and/or patient base line. Further, the spectrum can be analyzedagainst known pathogens, such as bacteria, viruses, funguses and otherparasites, to determine the existence and type of pathogen in the body.Further, prion infections can be detected by determining an increasesproduction of certain types of proteins, to replace those damaged by theprions. In accordance with a preferred embodiment of the invention, ESTdetermination is also automated by incorporating a DNA chip into thediagnosis device. The biological sample is then more directly inputtedinto the diagnosis device.

It should be appreciated that the above-described method of databasereduction is especially suitable for use with DNA chips, since themethod works well even with short nucleotide sequences, which are whatDNA chips are usually set to detect.

The automatic determination of alternative spliced regions and variantsadds another dimension to the mRNA spectrum. Instead of comparing onlythe relative amounts of certain types of mRNA, it is also possible tocompare the changes in the distributions of the alternative splicedvariants (the ratio between protein types). It is hypothesized that thealternative splice regions are differentially transcribed as a functionof stresses and diseases of the cell. In accordance with a preferredembodiment of the invention, cell activity is diagnosed by differentialanalysis of alternative spliced variant distribution in a single tissuetype. Preferably alternative splicing spectra are maintained for aplurality of tissue types, diseases and/or pathogens.

In accordance with a preferred embodiment of the invention, the mRNAassembly technique described herein is used for research purposes. Oneimportant type of research is using the mRNA sequences as drug leads.The mRNA sequences may describe whole new genes or they may be useful toas probes for detecting new genes. The new genes may also be used forscreening purposes, for example to develop and/or discover usefulpharmaceuticals and/or to detect genetic diseases and/or detectpathogens which include the mRNA sequence. Also, the mRNA sequences mayencode proteins which have various uses, for example as pharmaceuticals.Also, the mRNA sequences themselves may be useful, for example asanti-sense molecules and/or as part of gene-therapy. Since the number ofmRNA sequences generated by preferred embodiments of the invention aremuch fewer than the original number of ESTs, the number of leads whichare generated is much reduced, allowing a more focused drug search. Inaddition, by using a process in accordance with some preferredembodiments of the present invention, the resulting mRNA sequences arelonger and containing fewer errors, so a significant amount oflaboratory work can be avoided. Further, some laboratory work is avoidedby providing protein information derived by matching the mRNA sequencesto protein domain and/or family databases.

In accordance with another preferred embodiment of the invention, themRNA sequence with alternative splicing is compared to the originatinggenome, to more exactly determine the alternative spliced regions. Bysuch comparison is it possible to correct errors in the identificationof alternative spliced regions, since alternative spliced regions areusually contained and/or delimited by introns in the DNA. Theoriginating genome for the mRNA sequence can be more correctlydetermined using a mRNA sequence than by using only an EST. Further, theidentification of alternative spliced regions assists the search task,by indicating where a break in the correspondence may be expected. Inaddition, if such a search results in more than one site for a certainmRNA sequence, this result is more dependable than when multiple sitesare found using ESTs. It is hoped that by comparing mRNA sequences withalternative spliced variants to DNA sequences, the mechanism ofalternative splicing will be deciphered.

In accordance with a preferred embodiment of the invention, a DNA chipis designed to detect differences in the relative expression levels ofalternative spliced variants of a single mRNA sequence. Such a DNA chipmay be designed by first applying the above described methodology to anEST database to determine alternative spliced regions which may be ofinterest and then designing a DNA chip which detects all the differentvariants. Diagnosis can then proceed by comparing the relativeexpression levels with known spectra of different tissue types and/ordiseases and/or by detecting changes in the spectra which may beassociated with disease and/or certain types of stress.

In accordance with another preferred embodiment of the invention, a DNAchip is designed responsive to the relative distribution of short mRNAsequences in the database. Once the database is reduced to mRNAsequences, using above-described processes, a minimum number of shortDNA sequences, which uniquely identify a maximum number of mRNAsequences, is preferably determined. These methods are not useful on ESTdatabases, since there is too much redundancy and too many errors toguarantee a usable subset. Alternatively, the DNA sequences may beselected to generate a minimum error level during identification of aparticular group of mRNA sequences. Alternatively, identification of amaximum number of mRNA sequences is the goal. Alternatively,identification of a certain subset of mRNA sequences is the goal. ThemRNA sequences to be identified may be selected based on differentialanalysis and or based on genomic and/or mRNA mapping of pathogens. Bycorrectly selecting the mRNA sequences, a single DNA chip may also beused to detect a screen for a plurality of diseases.

In a preferred embodiment of the invention, DNA chip design is greatlyaccelerated using methods described herein. One important task requiredfor many methodologies of DNA chip design and analysis of data, isdetermining an index of short sequences of DNA in a large database ofmRNA sequences, ESTs or DNA sequences. As can be appreciated, the abovedescribed n-group indexing method for clustering may be used for thispurpose. Further, by repeating applying an n-group indexing (preferablywithout any limitation on distance between n-groups), the mRNA sequencesare divided into lists. A list which contains only one mRNA indicatesthat that mRNA sequence can be uniquely identified using only then-groups which identify that list. The fewer reindexings needed, thesmaller the number of n-groups needed to uniquely identify the mRNA orDNA sequence. Preferably, n is between 10 and 50, more preferably,between 20 and 30, most preferably, about 25.

In another example, DNA chip design may require that a probe for agenomic data base does not appear in a second database. I.e., a probefor a disease pathogen should not “detect” a naturally expressed mRNAsequence. Also, some sets of probes are more suitable and/or may be moreeffectively manufactured in a DNA chip setting. In a preferredembodiment of the invention, a the indexing may be utilized to determinenot only that a probe is unique but also a uniqueness score. The probesmay then be sorted in order of uniqueness and the selection ofappropriate probes or sets of probes may start from the top of the list.The uniqueness of a probe may be defined as a function of relativeexpression levels and the number and/or location of mismatchednucleotides between a probe and an existing EST portion. The aboveindexing method provides, as a side effect, locations of EST portionswhich are similar to the probe, i.e., where one, two or more n-groupsmatch. For each probe, it is possible to analyze the indexes of n-groupswhich appear in the probe to determine a score for the probe. It shouldbe noted that even if the score is not precise and some EST portions aremissed for a particular probe, this may be compensated for by selectingmore than one probe for a particular mRNA sequence. Also, once a probeis selected, the probe may be separately analyzed to determine if itsuniqueness is sufficient. In a preferred embodiment of the invention,the uniqueness score of a probe is used when analyzing the results of aDNA chip. Thus, each probe on a DNA chip may be separately evaluated toyield an individual false positive/false negative probability, based,for example, on known or assumed concentrations of nucleotide sequencesand/or know affinities of such nucleotide sequences to the probes used.

Another “side effect” of the above described clustering algorithm is thedetection of SNPs (Single Nucleotide Polymorphisms). If, for example,ten ESTs overlap and five have one nucleotide at a certain position andfive have a second nucleotide at the corresponding position, thedifferent nucleotide is a candidate SNP. Preferably the determination ifa single nucleotide difference is also a useful SNP, is dependent onstatistical considerations, which are preferably an input to the system.Such considerations may include the number of overlapping ESTs thenumber of ESTs in which each variant appears, the probability of errorsin sequencing and/or the effect of the difference on a protein encodedby the sequence. In a preferred embodiment of the invention, probes forthe SNPs are grouped together to form an assay of SNPs useful forgenetic mapping. Preferably, a large set of probes is manufactured on aDNA chip. In a preferred embodiment of the invention, the abovedescribed methods of determining a uniqueness of a probe are also usedto determine the uniqueness of SNP probes.

It should be appreciated that this method of n-group indexing is alsouseful for other methods of DNA detection which utilize theidentification of short DNA sequences to uniquely identify certain genesor mRNA sequences.

EXAMPLES

Attached herewith as an appendix “B” are transcript listings of mRNAsequences and clusters of ESTs, which were generated from a publicdomain database of a mouse, in accordance with preferred embodiments ofthe present invention. There are three cluster descriptions, each havingthe following format:

(a) a short description of the cluster;

(b) a list of the mRNA sequences and the associated ESTs used togenerate the sequences;

(c) for each EST alternative spliced variant, a cross-reference listingbetween the sequence and a consensus of all the ESTs;

(d) a sequence listing of the consensus of all the ESTs, which need notmatch any particular variant; and

(e) transcriptions of the alternative spliced variants detected for themRNA sequence.

For example, sequence number 10827, contains on page B-8 twotranscripts, one corresponding to each of the two alternative splicedvariants.

The cross-reference listing shown between page B-1 (left column) andpage B-2 (left column) shows gaps in the sequence for the 10827_(—)0variant. These gaps correspond to alternative spliced regions which arepart of the 10827_(—)1 variant, as shown on pages B-2 (left column) toB-2 (right column).

In a preferred embodiment of the invention, alternative spliced regionswhich correspond to graph nodes are displayed using a different color,so that they stand out on a graphical display.

The sequence 15537, on pages B-9 to B-20 contains only one variant,transcribed on page B-20.

The sequence 19101, on pages B-21 to B-26 contains four alternativespliced variants, only two of which are transcribed on page B-26, therest shown only as part of cross-referencing against the consensussequence. Additional transcripts of mRNA sequences are shown on pagesB-27 to B-31 and/or in Israel patent application 121,806, filed Sep. 21,1997, the disclosure of which is incorporated herein by reference.

In accordance with a preferred embodiment of the invention, a DNA chipis designed to differentially recognize a particular variant byincluding in the DNA chip sensor array only those n-groups (where n ispreferably 25) which appear only in one variant but not in the other.There is also provided in accordance with a preferred embodiment of theinvention, a kit of short DNA sequences, determined from such ananalysis of mRNA expression. Such kits are preferably constructed fromthe variants attached herein in appendix B.

It will be appreciated by a person skilled in the art that the presentinvention is not limited by what has thus far been described. Rather,the present invention is limited only by the claims which follow.

1. A method of comparing nucleic acid sequences being ESTs included in afirst database of sequences and nucleic acid sequences included in asecond database of sequences to form groups of sequences from the twodatabases that all relate to the same gene, the method comprising: foreach one or more n-groups of sequences of one of the two databases:(One) associating therewith lists of nucleic acid sequences, each fromone of said two databases, each sequence on the list containing then-groups; and (Two) matching sequences on the lists to generate saidgroup.
 2. A method for obtaining an mRNA sequence having alternativespliced variants from a database of ESTs, comprising: providing a rawdatabase comprising a plurality of ESTs; and assembling ones of saidESTs into mRNA sequences using the method of claim 1, wherein saidassembling includes identifying alternative spliced regions.
 3. A methodaccording to claim 2, comprising clustering ESTs which have matchingsegments and wherein said assembly comprising assembling ESTs which areclustered together.
 4. A method according to claim 2, comprisingcorrecting errors in said ESTs.
 5. An mRNA sequence determined by theprocess of claim
 4. 6. An mRNA sequence according to claim 5, whereinthe sequence comprises at least two alternative spliced regions.
 7. AnmRNA sequence according to claim 5, wherein the sequence comprises atleast three alternative spliced regions.
 8. An mRNA sequence accordingto claim 5, wherein the sequence comprises at least four alternativespliced regions.
 9. An mRNA sequence according to claim 7, wherein thesequence represents at least two alternative spliced variants of mRNAsequence, each variant utilizing at least one mutually exclusivealternative splice region.
 10. An mRNA sequence according to claim 7,wherein the sequence represents at least three alternative splicedvariants of mRNA, each variant utilizing at least one mutually exclusivealternative splice region.
 11. An mRNA sequence according to claim 7,wherein the sequence represents at least four alternative splicedvariants of mRNA, each variant utilizing at least one mutually exclusivealternative splice region.
 12. An mRNA sequence according to claim 7,wherein the mRNA sequence is obtained from a single tissue type.
 13. Amethod of mRNA assembly from a plurality of ESTs, comprising:determining a correspondence between segments in each EST according tothe method of claim 1; and generating a directed graph in which eachnode represents a single segment, and each transition between two nodesrepresents the existence of an EST in which the two correspondingsegments are consecutive.
 14. A method according to claim 13, comprisingclustering said ESTs into clusters of associated ESTs, wherein saiddetermining a correspondence is performed on individual clusters ofESTs.
 15. A method according to claim 13, comprising identifyingalternative spliced regions from said graph based on the morphology ofthe graph.
 16. A method according to claim 13, comprising correctingerrors in said ESTs based on said graph based on the morphology of thegraph.
 17. A method according to claim 16, comprising repeating saidclustering responsive to said corrected errors.
 18. A method ofidentifying errors in mRNA sequences, comprising: generating a graphwhich represents the assembly of segments of ESTs into an mRNA sequence;and analyzing said graph to determine unusual configurations of saidgraph.
 19. A method according to claim 18, wherein said analyzingcomprises identifying multiple end-nodes in said graph.
 20. A method oftuning a database reduction process, comprising: applying the databasereduction process, with a certain value for at least one parameter, to asample database; determining a reduction ratio in the database; andreapplying said method with a new value for said at least one parameterif said reduction ratio is not achieved.
 21. A method according to claim20, wherein said at least one parameter comprises the length of n-groupsused in matching two ESTs.
 22. A method of EST database processing,comprising: analyzing said ESTs to detect errors; further processingsaid ESTs to create mRNA sequences; determining, responsive to saidfurther processing, corrections for said errors; and correcting saiderrors.
 23. A method according to claim 22, wherein said furtherprocessing comprises assembling said ESTs into mRNA sequences.
 24. Amethod of designing a DNA chip based on an EST set determined bydifferential analysis of two biological samples, comprising: reducingsaid EST set to a set of mRNA sequences; analyzing said set of mRNAsequences to determine short mRNA sequences which maximallydifferentiate said mRNA sequences from mRNA sequences found in bothbiological samples; and designing a DNA chip which detects said shortmRNA sequences.
 25. A method of designing a DNA chip to detect relativeexpression levels of different variants of mRNA sequences havingalternative spliced regions, comprising: reducing an EST database todetermine an mRNA sequence having alternative spliced regions;enumerating short DNA sequences which are only included in thealternative spliced regions of said different variants; and designing aDNA chip which detects said short DNA sequences.
 26. A DNA chip designedaccording to the method of claim
 24. 27. A method of designing a DNAchip, comprising: indexing an mRNA database to determine the indexing ofshort DNA sequences in the mRNA database, which short DNA sequences areof a length suitable for detection by a DNA chip; determining from saidindexing a set of short DNA sequences which uniquely identify a desiredmRNA sequence; and designing a DNA chip which detects said set of shortDNA sequences.